Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC: fix incorrect paths when listing prefixes with ADLS #8661

Merged
merged 6 commits into from
May 30, 2024

Conversation

adutra
Copy link
Contributor

@adutra adutra commented May 29, 2024

Fixes #8659.

The core of the issue is that ADLS returns relative paths in response to a prefix-list operation, while S3 and GCS return absolute paths.

Fixing that issue uncovered a second issue: the BlobStorageException with error code PathNotFound wasn't being flagged as a not-found error.

A third issue was that we had multiple SLF4J bindings on the classpath, so no log messages were printed during tests. I fixed that too.

It was extremely hard to diagnose the issue given how difficult it is to test with ADLS:

  1. ITSparkIcebergNessieAzure had issues with incompatible Netty versions on the classpath; I fixed these, but it seems Azurite is not compatible with ADLS v2 list-prefix REST endpoint, so it's basically unusable.
  2. I introduced ITSparkIcebergNessieAzureMock that uses @snazy excellent mock server, and that worked once I got past two more issues:
    a. the serializability issue in AzureProperties (PR by @snazy still not merged).
    b. an issue with the list-prefix REST endpoint response – there was un undocumented required field (!!).

Basically if you want to test:

  1. Checkout @snazy branch Make AzureProperties w/ shared-key creds serializable apache/iceberg#10045;
  2. Publish your branch to Maven local: ./gradlew clean publishToMavenLocal -DsparkVersions=3.3,3.4,3.5;
  3. Change the iceberg version in gradle/libs.versions.toml;
  4. Remove the @Disabled annotation on ITSparkIcebergNessieAzureMock;

Then run:

./gradlew :nessie-gc-iceberg-inttest:intTest -DwithMavenLocal=true --tests org.projectnessie.gc.iceberg.inttest.ITSparkIcebergNessieAzureMock

One last issue that I decided not to fix is an issue with special chars (test case 3): I think that ADLSFileIO is inherently broken since it passes blob names unencoded to the ADLS client, but the client attempts to URL-decode the name, see com.azure.storage.blob.specialized.BlobAsyncClientBase. This blows up with:

URLDecoder: Illegal hex characters in escape (%) pattern - Error at index 0 in: "^&"
java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - Error at index 0 in: "^&"
	at java.base/java.net.URLDecoder.decode(URLDecoder.java:237)
	at java.base/java.net.URLDecoder.decode(URLDecoder.java:147)
	at com.azure.storage.common.Utility.decode(Utility.java:88)
	at com.azure.storage.common.Utility.urlDecode(Utility.java:55)
	at com.azure.storage.blob.specialized.BlobAsyncClientBase.<init>(BlobAsyncClientBase.java:242)

f -> {
StorageUri location = StorageUri.of(f.location());
if (!location.isAbsolute()) {
location = basePath.resolve("/").resolve(location);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't basePath.relativize(location) below equal to location in this case?
(Maybe I'm too tired ATM tho)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh no :-)

Here is a typical case with ADLS:

basePath  = abfs://[email protected]/warehouse/ns/table/
f.location() = warehouse/ns/table/whatever.parquet

So basePath.relativize(location) would yield warehouse/ns/table/whatever.parquet which would be then resolved against basePath giving a wrong result:

abfs://[email protected]/warehouse/ns/table/warehouse/ns/table/whatever.parquet

So we need to first take the "root" URI:

basePath.resolve("/") = abfs://[email protected]/

Then resolve location:

root.resolve(location) = abfs://[email protected]/warehouse/ns/table/whatever.parquet


// Enforce a single version of Netty among dependencies
// (Spark, Hadoop and Azure)
implementation(platform("io.netty:netty-bom:4.1.110.Final"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renovate will bump this one. You'd need a version range or make the version explicit in another way using "Gradle magic" (IIRC nessie-client built script has a few of those).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to bump this version, as long as one single version is used. If it ever happens that a new Netty version becomes incompatible with one of the components using it, then we'd need to force it – but for now, I think that's fine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - then maybe just add it to gradle/libs.versions.toml.
Maybe also make it an enforcedPlatform, which enforces the exact versions of that bom, whereas versions from a "plain" platform can be bumped.

Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the behavior of Iceberg's ADLS FileIO implmentation behaves differently from the S3, GCS and Haddop FileIO implementations.

S3 + GCS returns fully qualified URIs (s3://bucket/path/path/path/file) in FileInfo but ADLs the only full path, but as a relative URI.

I wonder what else would break in Iceberg, given that the ADLSFileIO.listPrefix() has a different behavior.

@snazy
Copy link
Member

snazy commented May 30, 2024

FYI adutra#1

@kbobrien
Copy link

I built a docker container locally using this fix and deployed it in our K8S lab to test against real data on Azure ADLSv2. Everything worked as expected and the files are now getting deleted.

Finished Nessie-GC identify phase finished with status IDENTIFY_SUCCESS after PT3.404244S, live-content-set ID is 5f2cf659-2941-44e0-9642-9461b786d40c.
2024-05-30 13:26:40,025 [main] INFO  o.p.g.e.local.DefaultLocalExpire - live-set#5f2cf659-2941-44e0-9642-9461b786d40c: Starting expiry.
2024-05-30 13:26:40,216 [ForkJoinPool-3-worker-4] INFO  org.apache.iceberg.CatalogUtil - Loading custom FileIO implementation: org.apache.iceberg.azure.adlsv2.ADLSFileIO
2024-05-30 13:26:40,522 [ForkJoinPool-3-worker-4] WARN  c.a.c.h.netty.implementation.Utility - The following Netty dependencies have versions that do not match the versions specified in the azure-core-http-netty pom.xml file. This may result in unexpected behavior. If your application runs without issue this message can be ignored, otherwise please update the Netty dependencies to match the versions specified in the pom.xml file. Versions found in runtime: 'io.netty:netty-common' version not found (expected: 4.1.101.Final),'io.netty:netty-handler' version not found (expected: 4.1.101.Final),'io.netty:netty-handler-proxy' version not found (expected: 4.1.101.Final),'io.netty:netty-buffer' version not found (expected: 4.1.101.Final),'io.netty:netty-codec' version not found (expected: 4.1.101.Final),'io.netty:netty-codec-http' version not found (expected: 4.1.101.Final),'io.netty:netty-codec-http2' version not found (expected: 4.1.101.Final),'io.netty:netty-transport-native-epoll' version: 4.1.100.Final (expected: 4.1.101.Final)
2024-05-30 13:26:53,614 [ForkJoinPool-3-worker-2] INFO  o.p.g.expire.PerContentDeleteExpired - live-set#5f2cf659-2941-44e0-9642-9461b786d40c content#46b37299-d0e2-4d35-a022-c98a8bdbce50: Found 5 total files in base location abfs://<REDACTED>@<REDACTED>.dfs.core.windows.net/iceberg/warehouse/t1_4b441898-d38a-4136-8cb8-612ffe2709e9/, 0 files considered expired, 5 files considered live, 0 files are newer than max-file-modification-time.
2024-05-30 13:26:53,708 [ForkJoinPool-3-worker-4] INFO  o.p.g.expire.PerContentDeleteExpired - live-set#5f2cf659-2941-44e0-9642-9461b786d40c content#f0746e53-4bc7-4419-8433-a0fcef1f6e46: Found 5 total files in base location abfs://<REDACTED>@<REDACTED>.dfs.core.windows.net/iceberg/warehouse/t1_555eb54d-8179-4252-a928-5b1ab20ced76/, 0 files considered expired, 5 files considered live, 0 files are newer than max-file-modification-time.
2024-05-30 13:26:53,708 [ForkJoinPool-3-worker-1] INFO  o.p.g.expire.PerContentDeleteExpired - live-set#5f2cf659-2941-44e0-9642-9461b786d40c content#e3b82ab0-18e5-4e7b-9fdd-fb96c3b3ab25: Found 5 total files in base location abfs://<REDACTED>@<REDACTED>.dfs.core.windows.net/warehouse/warehouse/t1_e1db2200-f9c1-48d5-bf38-ecf94781b351/, 0 files considered expired, 5 files considered live, 0 files are newer than max-file-modification-time.
2024-05-30 13:26:53,711 [ForkJoinPool-3-worker-3] INFO  o.p.g.expire.PerContentDeleteExpired - live-set#5f2cf659-2941-44e0-9642-9461b786d40c content#f74cad7b-8237-468b-8d7e-8dc284deb49f: Found 6 total files in base location abfs://<REDACTED>@<REDACTED>.dfs.core.windows.net/iceberg/warehouse/raw_101_298c61ed-902d-4f4c-8347-dca3840cb0c9/, 0 files considered expired, 6 files considered live, 0 files are newer than max-file-modification-time.
2024-05-30 13:26:55,011 [ForkJoinPool-3-worker-2] INFO  o.p.g.expire.PerContentDeleteExpired - live-set#5f2cf659-2941-44e0-9642-9461b786d40c content#76cb03c9-f10a-4d91-984d-b5d19a804afd: Found 5 total files in base location abfs://<REDACTED>@<REDACTED>.dfs.core.windows.net/warehouse/t1_2e8133c8-5c5a-4f0f-8f79-754939bb10ab/, 0 files considered expired, 5 files considered live, 0 files are newer than max-file-modification-time.
2024-05-30 13:26:55,212 [ForkJoinPool-3-worker-4] INFO  o.p.g.expire.PerContentDeleteExpired - live-set#5f2cf659-2941-44e0-9642-9461b786d40c content#3242292d-dcb0-4e4b-9f69-ef04b23f19c5: Found 5 total files in base location abfs://<REDACTED>@<REDACTED>.dfs.core.windows.net/iceberg/warehouse/t1_90b3bef1-c067-4d7d-b78d-6363441a176e/, 0 files considered expired, 5 files considered live, 0 files are newer than max-file-modification-time.
2024-05-30 13:27:08,906 [ForkJoinPool-3-worker-1] INFO  o.p.g.expire.PerContentDeleteExpired - live-set#5f2cf659-2941-44e0-9642-9461b786d40c content#a1c70148-c8b2-4a52-8a05-f3f4378283f7: Found 43 total files in base location abfs://<REDACTED>@<REDACTED>.dfs.core.windows.net/iceberg/warehouse/raw_workflow_events_c48dca5b-271f-4a38-9841-1d49170ecf0d/, 0 files considered expired, 43 files considered live, 0 files are newer than max-file-modification-time.
2024-05-30 13:27:22,209 [ForkJoinPool-3-worker-3] INFO  o.p.g.expire.PerContentDeleteExpired - live-set#5f2cf659-2941-44e0-9642-9461b786d40c content#2a5db5b8-16b4-45d3-b627-7ebb03b71491: Found 85 total files in base location abfs://<REDACTED>@<REDACTED>.dfs.core.windows.net/iceberg/warehouse/x_events_25ca607a-0a26-4c36-bb03-09850ebd0331/, 0 files considered expired, 85 files considered live, 0 files are newer than max-file-modification-time.
2024-05-30 13:27:22,211 [main] INFO  o.p.g.e.local.DefaultLocalExpire - live-set#5f2cf659-2941-44e0-9642-9461b786d40c: Expiry finished, took PT42.185851937S, deletion summary: DeleteSummary{deleted=0, failures=0}.
Nessie-GC sweep phase for live-content-set 5f2cf659-2941-44e0-9642-9461b786d40c finished with status EXPIRY_SUCCESS after PT3.404244S, deleted 0 files, 0 files could not be deleted.

Built container using...

tools/dockerbuild/build-push-images.sh \
 --dockerfile Dockerfile-gctool \
 --gradle-project :nessie-gc-tool \
 --project-dir gc/gc-tool \
 --local \
 local/nessie-gc

@adutra adutra merged commit 7cacd69 into projectnessie:main May 30, 2024
16 checks passed
@adutra adutra deleted the gc-adls-fix branch May 30, 2024 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants